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Abstract 

A new dimension reduction method based on Gaussian finite mixtures is proposed as an 
extension to sliced inverse regression (SIR). The model-based SIR (MSIR) approach allows 
the main limitation of SIR to be overcome, i.e., failure in the presence of regression symmetric 
relationships, without the need to impose further assumptions. Extensive numerical studies 
are presented to compare the new method with some of most popular dimension reduction 
methods, such as SIR, sliced average variance estimation, principal Hessian direction, and 
directional regression. MSIR appears sufficiently flexible to accommodate various regression 
functions, and its performance is comparable with or better, particularly as sample size 
grows, than other available methods. Lastly, MSIR is illustrated with two real data examples 
about ozone concentration regression, and hand-written digit classification. 

Keywords: dimension reduction, sliced inverse regression, mixture modeling, summary plots. 


1 Introduction 


The general aim of a regression analysis is to understand how the conditional cumulative 
distribution function (cdf) F(y|Ai) of a response variable Y varies as a set of p predictors 
X = {X\,X 2 , ■ ■. ,Xp)^ varies. Attention is often directed to the mean function E(y|Ai) and 
to the variance function Var(y|JA). Suppose that d < p linear combinations of the predictors 
exist such that we can write: 


F{Y\X) = F{Y\pJX,P^ X,... ,/3jX) = F{Y\B ' X 


T ■ 


( 1 ) 


where B = {Pi, ^ 2 : ■ ■ ■ i f^d) is a (p x d) matrix of rank(S) = d. If ([^ holds, then Y is 
independent of X given B^X, and we write Y _1LX|S^X. The structural dimension of a 
regression is defined as the smallest number of distinct linear combinations of the predictors 
required to characterize the regression of T on X. Equivalently, we can say that the subspace 
S{B) spanned by the columns of B is the dimension-reduction subspace (DRS) for the regression 
of y on X. It always exists, since we can trivially set B = I but, in this case, we do not reduce 
the dimension, as the aim is to reduce the dimensionality of the problem as much as possible. 
A minimum DRS has the property of having minimum dimension among all the DRSs for 
the regression of y on X. It can be shown that a minimum DRS may not be unique (of 
course, when several of such subspaces exist, they all have the same dimension). To avoid 
such non-uniqueness, the central dimension-reduction subspace (CDRS) has been defined as the 


intersection over all DRSs. If a CDRS exists, then it is the unique minimum DRS (Cook, 1998 


Chap. 6). Every plot of Y over a CDRS is called sufficient summary plot. If we plot Y over a 
minimum CDRS, we obtain a minimal sufficient summary plot which will contain all the sample 
information available in the data about E(y|X). 

The aim of dimension reduction methods is to estimate the central subspace without es¬ 
timating, or even assuming, a response model, and without strong assumptions on the form 
of the dependence between Y and X. Several methods have been proposed to estimate the 
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CDRS, such as sliced inverse regression (SIR; Li 1991), principal Hessian directions (PHD; [Lij 


1992), sliced average variance estimation (SAVE; Cook and Weisberg, 1991), parametric inverse 


regression (PIR;|Bura and Cook 2001), directional regression (DR; Li et al. 2005) and inverse 


regression estimation (IRE; Cook and Ni, 2005). They are all powerful premodeling tools for 
reducing high-dimensional regression problems by identifying a few linear combinations of the 
original predictors. When the structural dimension of the regression is 1, 2 or perhaps 3, as 
in most practical applications, the reduced dimensionality allows for effective visualization of 
data, and also greatly facilitates model building, particularly for non-parametric modeling. 

In this paper, we propose a new dimension reduction method based on finite Gaussian 
mixture models (GMM). The proposal is an extension of SIR, which allows us to avoid the 
limitations of the basic SIR procedure without imposing further conditions. The next section 
presents the model-based SIR (MSIR) method, which is then illustrated with simulated data 
sets and its behavior compared with other dimension reduction methods. The consistency and 
sensitivity of MSIR are also discussed. Section 3 deals with determining the dimensionality 
of the central subspace: two methods are discussed, a sequential test procedure and a BIC- 
type criterion. Section 4 analyses two real data examples: the first regards regression of ozone 
concentration levels on some primary pollutants and atmospheric conditions, and the second 
deals with the classification of hand-written digits. The final section presents some concluding 
remarks. 


2 Model-based sliced inverse regression 

2.1 Motivation 


Sliced inverse regression (SIR) is one of the first and perhaps the most popular dimension 
reduction method. Li ( |1991 ) showed that, in certain conditions, an estimate of the basis of 
GDRS can be obtained by the first d eigenvectors of the decomposition of Var(E(X|y)) with 
respect to Var(X). 

SIR requires the linearity condition and the coverage condition. The linearity condition 
concerns the marginal distribution of the predictors, i.e., X\B^X) must be linear in 

B^X for all a G DM ( |l99l[ ) emphasized that this condition is not a severe restriction, 
since most low-dimensional projections are close to being normal. With a fixed d, it holds 
approximately as p —)• oo (Hall and Li, 1993). In addition, the condition is required to hold 


only for the basis B of the GDRS. Since B is unknown, in practice it is required to hold for 
all possible B, which is equivalent to the elliptical symmetry distribution (such as multivariate 



may help when gross non-linearities are present. 

The coverage condition requires a method to recover all of the central subspace, not just 
part of it. In the context of ([^, this condition is equivalent to requiring that Gov(y, X) ^ 0 
( ]Yin and Cook 2005). It is well-known that SIR directions span at least a part of the CDRS 
(Cook, 1998, Prop. 10.1). This because SIR gains information from the variation in the inverse 
mean function but fails when symmetric dependencies are present; this is a case of violation of 
the coverage condition. 


Example Let us consider the simple model Y = X^-l-X^, where predictors X = (Ai, A 2 , A 3 , A 4 ) 
are sampled from A 4 ( 0 ,/ 4 ) distribution; for the sake of simplicity, no error term is included. 
The true dimension reduction subspace is spanned by (1,0,0,0) and (0,1,0,0), but SIR can 
only find the first direction, since E(Aj|y) = 0 for j = 2,3, 4. 
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2.2 Method 


SIR estimation is based on the information provided by the inverse regression mean function 
E(X|y). In practice, for a continuous response variable, the range of Y is sliced into H non¬ 
overlapping slices Sh, Y = {h : Y G 5/i} for /i = 1,..., fl, so that the number of observations 
in each slice is approximately equal. Then, variation on slice means, = E(X|y = h) for 
h = 1,... ,H, yields the SIR kernel matrix M = Var(E(X|y)), and SIR directions are obtained 
from the generalized eigendecomposition of M with respect to Var(X). The distribution of the 
data within any slice is summarized only by the within-slice means. The underlying assumption 
is that the distribution of the predictors is elliptical and compact. However, it may happen that 
the data follow a more complicated distribution, and important characteristics are lost if we do 
not take this into account. 

A more flexible modeling approach may be pursued by using finite mixtures of Gaussian 
densities to approximate the distribution of the predictors within any slice, and then obtain the 
kernel matrix from the corresponding component means. Let us assume that, for the h-th slice, 
the data can be described as follows: 

Kh 

f{x\Y = h) = fhix) = TThkHx; fJ-hk, ^hk), (2) 

fc=i 

where (p{.) is the multivariate Gaussian density with mean and covariance Ti^k are the 
mixing weights, so that iThk > 0 and Ylk'^^hk = I, and Kh is the number of components of the 
finite mixture. The marginal distribution of the predictors is thus given by: 

H H Kh 

fix) = Y Thfhix) = EE ^hk'Pix , l-lhk: ^hk)t 

h=l h=l fc=l 

where Th = Pr(y G Sh) and Uhk = Th^hk f^hk Y 0, Ylhk^hk = 1) is the weight associated with 
the /c-th component within slice h. The total number of mixture components is AT = 'Ylih=i 
Definition. Consider the kernel matrix: 

H Kh 

h=\ k=\ 

which is given by the covariance matrix of the between-component means and the marginal 
covariance matrix E = n~^ “ r)"*" with /x = Ylh'^k^hkfJ'hk- estimate of 

the CDRS is the solution of the following constrained optimization: 

argmaxg MB, subject to B^YIB = I, 

where B G is the spanning matrix and I is the {d x d) identity matrix. This is solved 

through the generalized eigendecomposition: 

Mvj = XjYiVj vf'YiVi = 1 if j = Z, and 0 otherwise, 

Y I2 ^ ^ 

The eigenvectors corresponding to the first d largest eigenvalues provide a basis for the CDRS, 
.Bmsir = ['*^1 ) • • •) "I’d]- There are at most d = min(p, K — 1) directions which span this subspace, 
and these are the ones which show the maximal variation between component means. When 
only one mixture component is used for each slice, i.e., Kh = 1 for all slices h = 1,..., H, the 
kernel matrix of MSIR is equal to that provided by SIR. 

We call this approach MSIR {Model-based SIR), so that the CDRS is spanned by directions 
-BmsiR) and the projections onto the subspace are defined as Z = 
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Proposition. Each eigenvalue of the eigendecomposition in is given by the variance of the 
between-component means along the corresponding direction of the projection subspace, i.e. 

A, =Var(E(Z,|y*)), Vj = l,...,d, 

where Y* = {k : Y € S^}, being the set made up of mixture components within each slice 
{k = l,...,K). 

Proof. Eor any kernel matrix, we may rewrite the eigendecomposition in Q as MV = HVL. 
Since by definition V'^YiV = /, the diagonal matrix of eigenvalues L = diag(Ai) may be 
expressed as: 

L = V^MV = Var(E(X|y*))'E = Var(E(Z|y*)) = diag(Var(E(Zj|y*)), 

where Z = are the MSIR predictors. Therefore, each eigenvalue is equal to the 

variance of the between-component means along the associated direction. □ 

Eollowing this result, we can interpret the contribution of each direction to the estimation 
of the CDRS. In addition, the directions corresponding to small eigenvalues provide little or no 
information about differences in means within components. Eormal assessment of the number 
of directions required to span the CDRS is discussed in Section 


2.3 Estimation 


MSIR estimation can be pursued by applying the eigendecomposition in (§ with suitable esti¬ 
mates of the unknown matrices M and 51. The usual sample covariance matrix S is used for 
the latter. An estimate M of the kernel matrix is computed from the estimated within-slice 
component means {k = 1 ,..., Kh', h = 1,..., H) obtained by fitting the finite mixture 
models in Q. 

The most popular algorithm to estimate finite mixture parameters is the Expectation- 


Maximization (EM) algorithm (Dempster et al., 1977), which converges to a maximum likelihood 


estimate of the mixture parameters. In the context of finite mixture models, an important point 
is the choice of the correct model (McLachlan and Peel, 2000, Chapter 6). In our case, this 
amounts to choosing both the covariance structure and the number of components. Parsimo¬ 
nious parameterization of the covariance matrices for each component within slice, 'Shk, can be 
achieved by imposing restrictions on such geometric feature as volume, shape and orientation 


of the corresponding hyperellipsoids (Banfield and Raftery, 1993; Celeux and Govaert, 1995) 


This model selection step clearly affects the estimation of means and mixture proportions 
TThk, and thus kernel matrix M. 

One common approach to the problem of model selection in finite mixture modeling is based 
on Bayesian model selection via Bayes factors. Kass and Raftery (1995) showed than an ap¬ 


proximation to the Bayes factor can simply be computed through the Bayesian Information 
Criterion (BIG). This proved to be efficient on practical grounds, particularly for density esti¬ 


mation (Praley and Raftery, 1998, 2002). Alternatively, Biernacki et al. (2000) and Biernacki 


et al. (2006) discussed the use of the Integrated Complete Likelihood (ICL) criterion. 
The algorithm for MSIR estimation may be summarized as follows: 


1. Obtain a sliced version Y of response variable Y using H non-overlapping slices (this 
step is not needed if Y has support on a finite number of points, such as a discrete or a 
categorical variable). 

2. Eit Gaussian finite mixture models with the EM algorithm to approximate the distri¬ 
bution of X\{Y = h) for h = 1,... ,H. The number of components Kh and covariance 
structure 'Shk within each slice are selected by the BIG criterion. 
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3. Compute kernel matrix M from the means estimated for each mixture component within 
slices. 

4. Perform the generalized eigendecomposition of M with respect to the sample covariance 
matrix 5] of the predictors. 

5. The corresponding eigenvectors provide an estimate of the basis of the subspace, and 
are indicated as Smsir = {Pi, ■ ■ ■ ,Pd)i where Pj = Uj/||nj|| for j = 1,... ,d, i.e., each 
direction is scaled to have unit norm. 


Example (continued) Recalling the example discussed at the end of Section 2.1, the left- 
hand graphs in Figure[^show the plots of the response variable vs the first two predictors, which 
correspond to the basis of the subspace, for a sample of n = 400 observations. The vertical 
ticks at the bottom of each graph represent the slice means for H = 5 slices. As can be seen, 
the slice means for the second predictor are almost equal, which is why SIR is prevented from 
recovering this direction. Instead, the MSIR method discussed here is also able to recover the 
second direction. The right-hand graphs in Figure show the plots of the response variable 
vs the first two predictors with the estimated slice components means at the bottom. Now, 
means along the direction of the second predictor are spread out, which enables MSIR to re¬ 
cover the corresponding direction. The estimated coefficients for the basis of the subspace are 
(0.033, 0.998, —0.010, —0.045) and (0.999, —0.034, —0.016, —0.020), with corresponding eigenval¬ 
ues 0.847 and 0.623 (those associated with the null space are 0.107 and 0.027). Thus, the first 
direction can capture the symmetric curve, and the second direction shows the linear trend. 


2.4 Consistency of MSIR estimator 


Li (1991, Section 5) demonstrated the \/n-consistency of the SIR estimator. His arguments 


were based on the consistency of the individual components of the SIR algorithm. In analogy, 
we argue that the MSIR estimator is \/n-consistent. A full asymptotic analysis of the sample 
properties exceeds the scope of this paper, so this section provides a few basic ideas and results. 
Let us consider the population MSIR decomposition matrix in (|^ in the equivalent form 
E is a i/n-consistent estimator of E by the central limit theorem and, provided 

that E is nonsingular, S is also a -y/n-consistent estimator of by the continuous map¬ 

ping theorem. In analogy, M is a \/n-consistent estimator of M. Therefore, the eigenvectors of 

E ALE are -y/n-consistent estimators of the eigenvectors of the population counterpart. 

Figure shows the average maximal angle between the true subspace and the snbspace 
estimated by MSIR as a function o^l/^/n for some settings of the models discussed in Section 3.1 
If \/n-consistency holds, then an approximately linear relationship should be visible in the graph, 
and this is the case for the examples considered. 


3 Simulation studies 


3.1 Estimation accuracy 

In this section we use simulations to examine the ability of MSIR to recover the true subspace, 
and compare its performance with that of other dimension reduction methods, such as SIR, 
SAVE, PHD and DR. To evaluate the accuracy of a dimension reduction method to estimate 
the true CDRS, we made use of the following distance measure (see also Li et ah, 2005). Let 
S{B) and S{B) be two d-dimensional subspaces of spanned respectively by true basis B 
and an arbitrary estimate B. Also let Ps{B), ^siB) corresponding orthogonal projections 

onto S{B) and S{B). These subspaces may be compared through the following measure: 


A{B,B) = IIP, 


5(B) ^S{B) 


= \\B{B'^B 


)-^B'^ - B(B^ 


B) ^B 


1 oT I 


(4) 
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SIR 


MSIR 



Figure 1: Plots of the response variable vs the first two predictors which form the basis of the 
subspace for the model Y = +X|, X = (X^, X 2 , X 3 , X^) ~ A^4(0, i'4). The horizontal dotted 

lines show the cutoff values used for slicing the response variable. Left graphs for SIR: vertical 
ticks at bottom of each plot represent the estimated within-slice means along corresponding 
direction. Right graphs for MSIR: points are marked by different symbols according to mixture 
component within-slice to which they are assigned; in this case, vertical ticks at bottom of each 
plot represent the estimated components within-slice means along corresponding direction. 
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Model 1 Model 3 Model 4 Model 5 



Figure 2: Estimation accuracy of MSIR for checking i/n-consistency. Each graph plots the 
average maximal angle between MSIR estimates and true subspace against Data from 

simulations with p = 10 predictors and parameters a = 0.5 for models 1 and 3, /? = 0.5 for 
model 4, a = 0.5 for model 5. 


where ||.|| is the spectral Euclidean norm, i.e., the maximum singular value (Gentle, 2007). 
Equation 0 measures r naximal angle a between two subspaces of ] 


It can be shown that 


A(S,S) = sin a G [0,1] ([Meyer 


2000 


p. 455). 


In the following, we treat dimension d of the CDRS as fixed. Only some results are shown 
here (tables and graphics of the complete simulation study appear in the Supplementary mate¬ 
rial) . 

Model 1. Consider the following single-index model with a symmetric response curve: 


Y = (0.5/3^X)^ + ae. 


where /3 = (1, —1, 0,..., O)"*", and the predictors and the error term follow independent standard 
normal distributions. It is known that one of the major limitations of SIR arises from the 
presence of symmetric response curves. The left-hand graph in Figure shows a scatterplot of 
the response variable vs the first estimated SIR direction for a sample of n = 200 observations 
on p = 5 predictors, and a = 0.1. The curved mean function is completely absent along this 
projection. Conversely, the direction estimated by MSIR is shown in the right-hand graph, and 
the symmetric relationship with the response variable is clearly visible: note that A(/3 msiR) /5) = 
0.086, which corresponds to an angle of 4.9°, compared with an angle of 88° for SIR. 

MSIR seems to be a great improvement over SIR, but it is also interesting to compare its 
behavior with other dimension reduction methods, particularly PHD and SAVE, which were 
developed to deal with such a situation. Figure shows the results of a simulation study for the 
above symmetric response model with number of predictors p = 10 at various sample sizes (n) 
and error standard deviations (cr). Overall, MSIR is a great improvement over SIR. Compared 
with SAVE and PHD, which are known to work particularly well in the case of symmetric and 
curved relationships, the accuracy of MSIR is comparable when a is small and as sample size 
increases. When a large amount of noise is present and sample size is relatively small, MSIR 
tends to perform slightly less well. However, for less noisy data, the accuracy of MSIR is higher 
than with SAVE, PHD and DR. Note that for this model the accuracy of DR is very similar to 
that of SAVE. 

Model 2. Consider the two-dimensional regression model 

Y = ^JX + {pJxf + ae, 
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SIR 1st direction 


MSIR 1st direction 


Figure 3: Summary plots for single-index regression model with symmetric curve: response 
variable is plotted against first SIR direction (left) and first MSIR direction (right). 
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Figure 4: Simulation results for model 1 : average maximum angle between true and estimated 
subspaces based on 500 simulations for p = 10 predictors at different sample sizes (n) and error 
standard deviation (a). 


where /3i = (1,0,..., O)"*", /32 = (0,1,0,..., O)"*", and the predictors and the error term follow 
independent standard normal distributions. This model has both a linear trend and a symmetric 
quadratic curve along two different directions. We expect SIR to be able to recover the first 
direction but not the second, whereas the opposite is expected for PHD. SAVE and DR should 
be able to recover both directions, but with a different degree of efficiency. 

Figure shows the results from a simulation study based on 500 repetitions for each combi¬ 
nation of sample sizes (n) and error standard deviation (a), with number of predictors p = 10. 
Clearly, MSIR outperforms SIR and PHD in all these settings. Its accuracy is comparable to 
SAVE when p = 5 (see Supplementary material) but, as p increases, MSIR is much better 
than SAVE. The behavior of MSIR and DR are comparable, although DR tends to provide 
slightly better accuracy for small sample sizes, whereas MSIR tends to achieve better accuracy 
as sample size grows. 

Model 3. Consider the following two-dimensional model with response rational function: 


Y = 


0.5 + (1.5 + /3jX)2 


+ (1 + 
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Figure 5: Simulation results for model 2: average maximum angle between true and estimated 
subspaces based on 500 simulations for p = 10 predictors at different sample sizes (n) and error 
standard deviation (a). 


where /3i = (1,0 ,..., 0)"'", (32 = (0,1 ,..., 0)"'", with the predictors and the error term which 
follow independent standard normal distributions. The response surface for this model shows a 
noisy linear trend along the first direction and a strong non-symmetric curve along the second 
direction. 

The simulation results of some dimension reduction methods are shown in Figure Overall, 
MSIR is slightly, but uniformly, more accurate than SIR or DR, which behave similarly, and 
it is much more accurate than SAVE and PHD. MSIR, SIR and DR all improve as sample 
size increases, and the same happens for SAVE, except when the number of predictors is large 
{p = 20, see Supplementary material). PHD performs quite badly for this data-generating 
model. 
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Figure 6: Simulation results for model 3: average maximum angle between true and estimated 
subspaces based on 500 simulations for p = 10 predictors at different sample sizes (n) and error 
standard deviation (a). 

Model 4. To investigate the performance of the MSIR estimator in the case of correlated 
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predictors we consider the following response model: 

y = + e, 

where /3 = (1,1,1,0 ,..., 0)"'' and e ~ A^(0,1), independent of covariates. Predictors vector 
X = (Xi,..., Xp) follows a standard multivariate normal distribution with correlation between 
Xi and Xj given by 

Simulation results are shown in Figure In general, we note that MSIR is uniformly more 
accurate, i.e., it always achieves a smaller angle with the true subspace than the other dimension 
reduction methods. When the predictors are uncorrelated (p = 0), SIR, PHD and DR all provide 
comparable accuracy, whereas SAVE quickly deteriorates as the number of predictors increases 
(see Supplementary material). As the correlation among predictors increases, the improvement 
of MSIR with respect to the other methods becomes larger. DR and PHD show similar behavior, 
but SIR and SAVE appear to be the least efficient methods if highly correlated predictors are 
present. 
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Eigure 7: Simulation results for model 4- average maximum angle between true and estimated 
subspaces based on 500 simulations for p = 10 predictors at different sample sizes (n) and 
correlation coefficient (p). 


Model 5. We now consider the model discussed by Li et al. (2005, Example 6.5), i.e., 

Y = ^{l3'^X-afe, 


(5) 


where /I = (1, 0,..., O)"*", X ~ A^(0, Iio) and e ~ V(0,1), independent of predictors. Here, only 
the variance of Y depends on the predictors and, in particular, it is a quadratic function of Xi 
centered on values a = {0,0.5,1}. Since PHD is not capable of estimating a direction which 


only appears in the variance function (Cook and Li, 2002), we expect PHD to perform poorly 


for this model. This should also happen for SIR when a = 0, since in this case the function is 
symmetric around the origin. 

Figure shows the results of a simulation study based on 500 replications. When a = 
0, SAVE and DR perform very similarly, whereas MSIR improves as sample size increases, 
achieving the smallest angle when n > 500. As expected, in this case, neither SIR or PHD can 
estimate the true subspace. When a increases to 0.5, the performance of SAVE worsens and 
DR achieves the smallest angle for small sample sizes. MSIR closely follows DR and, again, it 
appears to be the best method for large sample sizes. In this case, SIR greatly improves with 
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respect to the previous case, but PHD does not improve at all. When a = 1, SIR achieves the 
best performance for small samples, very closely followed by MSIR and then by DR. SAVE needs 
large sample sizes to achieve comparable accuracy, and PHD is still the worst method. Overall, 
we note that, provided that sample size is moderate to large, MSIR can provide an accurate 
estimate of the dimension reduction subspace in different settings when the dependence only 
appears in the variance function. 
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Figure 8: Simulation results for model 5: average maximum angle between true and estimated 
subspaces based on 500 simulations for p = 10 predictors at different sample sizes (n) and 
constant a. 


3.2 Sensitivity of MSIR algorithm to number of slices 


The number of slices acts as a tuning parameter, like the span width or kernel bandwidth in 
smoothing approaches. Estimation of MSIR, like that of SIR, is not overly sensitive to the choice 
of the number of slices. However, we must ensure a sufficient number of observations within 
any slice to fit finite mixture models. By default, we use H = max(3, \}og 2 {n/^)\) number 
of slices, where [ttj indicates the largest integer not greater than u. The resulting number of 
slices depends on both the amount of data available and the dimension of the predictor space 
(see Figure]^. In order to have a large number of slices, we need either a large sample or a 
small number of predictors; for a fixed number of predictors, the number of slices increases as 
sample size increases. 

One natural concern involves the sensitivity of the MSIR algorithm with respect to tuning 
parameter H. To address this issue, a simulation study was conducted in which, for models 


1“4 described in Section 3.1, we assessed the ability of MSIR to recover the true subspace when 


both sample size and number of slices vary. We set p = 5 for the first three models with a = 0.1, 
and p = 10 with p = 0.5 for model 4. 


Figure 10 shows the results of this simulation study. In general, the behavior of MSIR is 
quite stable, as long as we allow for enough observations within slices. For the first model, when 
n = 100, the distributions are similar up to R = 5, and over H > 5 the angles become very 
large. When n = 200, the break-point is at R = 10, but is at R = 20 when n = 500, and 
at a value larger than 30 for samples of size n = 1000. These characteristics are also found in 
the results for the second and fourth models, the third model shows a more stable distribution 
across values of H. Figure [^indicates that the default number of slices is H = (5, 6, 7,8) when, 
respectively, n = (100, 200, 500,1000) for the first three models, and H = (4, 5, 7,8) for the last 
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Figure 9: Default number of slices used in MSIR algorithm as a function of sample size n and 
number of predictors p. 


model. These values are shown as vertically shaded bars in Figure 10 and seem to provide 
reasonable defaults. 


3.3 Computing time 

Table gives the CPU times (in seconds) required by MSIR and other dimension reduction 
methods for data generated from Model 1 in Section |3.1| with different numbers of predictors 
(p) and sample sizes (n). The calculations are performed in R (R Development Core Team 


2011) with a 2.2 GHz Intel Core 2 Duo Macbook Pro with 2GB RAM. Clearly, MSIR needs 
more computing time than the other methods, particularly as sample size increases. This is 
mainly because MSIR needs to estimate several mixture models via the EM algorithm and to 
perform model selection within each slice, in order to choose the appropriate parameterization 
and number of components. 


Table 1: Comparison of computing times (in seconds) 


p 

n 

SIR 

SAVE 

PHD 

DR 

MSIR 


100 

0.012 

0.012 

0.008 

0.108 

0.186 

10 

500 

0.020 

0.020 

0.013 

0.517 

3.535 


1000 

0.031 

0.031 

0.023 

1.025 

24.350 


100 

0.021 

0.021 

0.010 

0.265 

0.196 

20 

500 

0.037 

0.039 

0.024 

1.218 

3.964 


1000 

0.058 

0.058 

0.042 

2.455 

32.307 
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Figure 10: Angles (degrees) between true subspace and MSIR estimated subspaces for model 


discussed in Section 3.1 as a function of number of slices (II) with increasing sample sizes (n). 
Compare with default values for H, in Figure represented by vertical shading bars. 
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4 Determination of dimension of CDRS 


Assessing the dimension of the CDRS is an important question in any dimension reduction 
method. A plot of Y versus the first few MSIR predictors Zj = where Smsir = 

{Pi, (32 ,. • •), is usually very informative, but inference on the dimension of the CDRS is still 
required. A popular method is based on the sequential chi-square test proposed by Li (1991), 
whereas a more recent approach is based on a BIC-type criterion. In this section, we discuss 
how to apply these two methods in the MSIR case. 


4.1 Permutation test 


Li (1991) proposed a sequential test procedure for SIR based on the statistic 


Ad = ra ^ \j, 
j=d+l 


( 6 ) 


which, under the assumption that the predictors are normally distributed, has an asymptotic 
chi-square distribution with {p — d){H — d — 1) degrees of freedom. In general, chi-square 
asymptotic distribution holds for any distribution of the predictors under the linearity and 


constant covariance conditions (Bura and Cook, 2001). For other dimension reduction methods, 
for instance SAVE, the null distribution of statistic ([^ is unknown, even asymptotically. In 
these cases, and for SIR when the linearity and constant covariance conditions are not satisfied. 


Cook and Weisberg (1991) and Cook and Yin (2001) proposed a general permutation test which 


can be easily adapted to our case. 

Consider partition B = {Bi, B 2 ) of the {p x p) matrix of eigenvectors of population kernel 
matrix M, where Bi = {Pi,... ,P^) and B 2 = {Pd+i^ ■ ■ ■ iPp)- Assume that the independence 
condition between (Y, and SjX holds for testing hypothesis Hq : rank(Af) < d versus 

Hi : rank(Af) > d. The observed test statistic A^ (for d = 0,1,... ,p — l) can be compared to its 
permutation distribution under the null hypothesis. Starting with d = 0, the test procedure is 
performed sequentially. If the null hypothesis is not rejected for a given value of d, then the last 
{p — d) MSIR predictors S J X can be discarded without loss of information on the regression 
of Y on X. Thus, the testing procedure involves the following steps: 

1 . for a given sample kernel matrix M, compute the eigendecomposition in ([^ to obtain 
eigenvectors Bi = {Pi,...,P^) and B 2 = {P^+i,..., Pp), with associated eigenvalues 
Xi,... ,Xd and A^+i, ■ ■ ■ ,Xp; 

2 . compute the observed value of test statistic A^; 

3. obtain the vectors of sample MSIR predictors Zn = BjX.i and Zi 2 

1 ,..., n, ^ ^ 

4. randomly permute indices i of Zi 2 to obtain permuted data Zi* 2 ', 

5. apply the MSIR procedure to original data Yi, Zn and permuted data Zi* 2 , to obtain 
the value of permuted test statistic A^; 

6 . repeat steps 4 and 5 a number of times. The p-value for testing the null hypothesis is 
estimated as the fraction of A^ exceeding A^^. 

For d = 0,1,... ,p — 1, we test rank(A/') sequentially, and estimate d = do if do is such that the 
corresponding p-value is the first one greater than a fixed significance level, say a = 0.05, in the 
series. If we reject all the hypotheses, we conclude that rank(Af) = p. 


= Bj'Ki, for i = 


4.2 BIC-type criterion 


Zhu et al. (2006) and Zhu and Zhu ( 2007| proposed a consistent BIC-type procedure to deter¬ 


mine the dimension of the CDRS. Let = r-l-Jj, and fi = T Yip, where E is the kernel matrix 
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> Op 


be 


for standardized predictors and Ip is the (p x p) identity matrix. Let 9i > O 2 > 
the eigenvalues of fl and 9i > 62 > ■ ■ ■ > Op those of Q. Clearly, 9i = A* + 1, where Aj are the 
eigenvalues of F, and the dimension of the CDRS is given by the number of eigenvalues of ft 


greater than 1. Zhu et al. (2006) showed that a BIC-type criterion can be defined as follows 


G{d) = logLd - C{n,p,d), 

where logL^ = ^ Z]r=i+min(r,d)(^°s(^j) + 1 - Oi), with r denoting the number of Oi > 1, and 
C{n,p,d) is a penalty term which depends on the number of free parameters to be estimated. 
In the original proposal the penalty term was defined as C{n,p, d) = Cnd{2p — d + l)/2, with 
Cn = (0.51og(n) + 0.1n^/^)/(2(n/iL)), where n/H is the average number of data points within 
each slice. However, this definition of the penalty term was based on favorable empirical evidence 
among a candidate set of penalty terms. Later, Zhu and Zhu (2007) noted that the number 


of Oi to be estimated are (p — d), and suggested the use of the penalty C{n,p,d) = —{p — 
d)\og{n). The dimension of the CDRS is then estimated as the maximizer of G{d), i.e. d = 
argmaxQ<^<p_]^ G{d). This BIC-type procedure for selecting the dimension of the CDRS is 

^ ^1/2^ ^1/2 

easily applied to the MSIR approach by setting F = 51 M'S 


4.3 Simulation study 


We conducted a simulation study using the first four models described in Section ^ to in¬ 
vestigate the accuracy of the permutation test (PT) procedure and the BIC-type criterion in 
choosing the correct dimension of the CDRS. Figure [TT] shows the results of these simulations, 
plotting fractions F(i) and based on 500 replications, in which a procedure (PT or BIC) 

selected d = i and d = i 01 d = j versus sample size. To simplify the discussion, only the results 
for case p = 10 are reported. 

For the first model, which has d = 1, the PT procedure tends to select the correct value 
as sample size increases. When sample size is small and there is a large amount of noise, the 
procedure sometimes underestimates the true dimension. The behavior of BIC is similar to that 
of PT, except for when n = 1000 and a = 0.1, in which case it overestimates the dimensionality. 
For the second model, which has d = 2, the BIC-type criterion greatly improves as sample 
size increases, whereas the PT procedure is more accurate for small sample sizes. The noise 
component does not seem to affect the accuracy of either procedures. On the contrary, it 
has a large effect for model 3, which also has d = 2. In this case, both PT and BIC worsen 
as a increases: in particular, they tend to select only one direction as relevant. For the last 
model, where d = 1, the PT procedure performs well, and the BIC-type criterion is comparable 
when the predictors are uncorrelated or very strongly correlated but, if p = 0.5, it tends to 
overestimate the true dimensionality. 

Overall, both procedures provide reliable estimates of the dimension of the CDRS. The 
permutation test procedure is more accurate when sample size is not large, whereas the BIC- 
type criterion is more efficient as sample size increases. 
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Figure 11: Inference about d from simulations for four models described in Section [3.1[ 
F{i,j) are fractions of runs in which estimated d was one of the arguments. 
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5 Data analysis 

5.1 Chicago air pollution data 

Atmospheric pollutants are responsible for serious environmental pollution and may have dan¬ 
gerous effects on public health. Pollutants are often classified as either primary or secondary. 
Primary pollutants are released into the atmosphere during combustion processes of any kind 
(volcanic eruptions, motor vehicle exhausts, etc.), and include carbon monoxide CD, nitrogen 
dioxide N02, sulfur dioxide S02, and particulate matter with diameter smaller than 10 microns 
PMIO. After their release into the atmosphere, primary pollutants are subject to processes 
of diffusion, transport and deposition. They also undergo processes of chemical and physical 
transformation, which may lead to the formation of secondary pollutants. These are formed 
from primary pollutants as a result of changes of various kinds caused by reactions which often 
involve atmospheric oxygen and weather conditions. Of main interest is the ground level of 
ozone (03) which, at abnormally high concentrations, caused by human activities (mainly the 
combustion of fossil fuel) is a dangerous pollutant. 


Table 2: Model-based SIR results for air pollution data. 


Slices 

I 

2 

3 

4 

5 


6 

7 


GMM 

XXX 

EEI 

VVV 

VEI 

VEI 

XXX XXX 

Num. comp. 

I 

4 

2 

3 

3 


1 

1 


Num. obs. 

52 

5|23|7|17 

45|7 

13|31|8 

26|3|23 


52 

51 


Predictors 



Standardized basis 






Diri 

Dir2 

Dirs 

Dir4 Dirs 


Dire 


T 

0.6824 

0.15446 

0.00996 

-0.15674 0.6137 


-0.1132 


H 

-0.1307 

-0.07566 

-0.40980 

0.48761 0.4306 


0.2151 


PMIO 

0.1189 

-0.48158 

-0.38248 

0.38666 -0.4828 


-0.5056 


S02 

-0.1406 

-0.43371 

-0.44994 

-0.59750 0.2859 


0.3660 


N02 

0.6204 

0.37996 

0.13235 

0.47879 -0.2149 


0.6795 


CO 

-0.3136 

-0.63719 

0.68243 

-0.04374 0.2775 


-0.2994 


Eigenvalues 


0.7381 

0.4514 0.1828 

0.1371 

0.09066 0.04821 

Structural dimension 

0 

1 

2 

3 



4 

5 

BIC-type criterion 

-17.77 

9.974 

18.4 

15.21 

10.88 

5.69 

Test statistic 


598.4 

330.4 

166.5 

100.2 

50.41 

17.5 

Permutation 

p-value 

0 

0.01 

0.25 

0.29 


0.36 

0.33 


We considered daily data collected in Chicago in 1997 and available at http: //www. ihapss. 
jhsph.edu/data/data.htm, We aimed at modeling ozone concentration Y on some primary 
pollutants and weather conditions (temperature T and humidity H). The results from MSIR 
estimation are shown in Table the first part of the table lists the type of GMM fitted for 
each slice (for the meaning of symbols, see Fraley and Raftery 2006), the number of mixture 
components, and the number of observations for each within-slice component. The second part 
of the table shows the predictor coefficients, scaled to have standard deviation equal to one, 
associated with the estimated directions. The eigenvalues of the MSIR kernel matrix are also 
shown, together with the BIC-type criterion and the permutation test described in Section 
Both methods indicate a two-dimensional structure. 

The plot of the response variable versus the first two MSIR variates are shown in Figure [T^ 
where smooth functions for mean and variance have been added as described in [Weisber^ 
(2005, pp. 275-278). A rotating 3D plot is also available in the Supplementary material. An 


increasing trend with constant variance is associated with the first MSIR direction, which is 
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mainly determined by predictors T and N02. Thus, an increase in ozone level is associated with 
increasing values of temperature and nitrogen dioxide. The second direction shows a curved 
relationship, with non-constant variance. However, its interpretation is less straightforward: 
there is a positive relationship with T and N02, as in the first direction, but an inverse relationship 
with the other predictors, especially PMIO, S02 and CO. 

When we compare the estimated MSIR directions with those obtained by other dimension 
reduction methods, we can see that the first MSIR variate has E? ~ 0.98 with the first SIR 
variate, and 0.92 with the first DR variate. Therefore, the three methods essentially identify the 
same direction. In contrast, the second MSIR variate has an B? of about 0.5 with the second 
variate estimated by both SIR and DR. Therefore, although these directions are different, they 
all show a heteroskedastic shape. 



MSIR Dir1 


MSIR Dir2 


Figure 12: Summary plots of ground level ozone concentration (Y) versus first two estimated 
MSIR directions, with smooth functions for mean and variance. 


5.2 Pen digit data 


The data for this pattern recognition problem on handwritten digits come from the UCI 
machine-learning repository and contain samples of handwritten digits {0,1 ,...,9} collected 
from 44 different writers. Each digit is stored as a 16-dimensional vector. The data set is 
divided into a training set and a learning set. We focus on the data involving three digits, 
(0, 6, 9}. Because of their similar shape, they are among the most difficult to identify. These 


data were analysed by Zhu and Hastie (2003) by means of several procedures including SIR 


and SAVE, and by Li and Wang (2007) with DR. The latter authors noted that SIR provides 


only locational separation of the three types of digits, whereas their DR method also provides 


a distinction in variation (see Figure 3 of Li and Wang, 2007). 

For this classification problem, the response variable is the class label of each digit. 


We 


applied the proposed MSIR method to the training set made up of 2219 digits. For the group of 
0 digits, the selected GMM was a 9-component mixture with ellipsoidal equal shape covariance 
matrices (VEV). A 7-component GMM was selected for the group of 6 digits, whereas a 5- 
component mixture for the group of 9 digits, both with ellipsoidal equal volume and shape 
covariance matrices (EEV). Figure [T^ shows a static view of a 3D plot of observations projected 
along the first three MSIR directions (for a rotating 3D plot, see the Supplementary material). 
The three groups of digits appear to be well separated by both location and variation, with a 
small separate sub-group of points for digits 9, and some outliers. Comparing this plot with 


Figure 3 of Li and Wang (2007), we note that the main characteristics of the data are retained. 
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but some other features are also visible, such as the more compact shape for the main group of 
9’s, and the elongated, curved cluster of O’s. 



Figure 13: Static view of a 3D plot of handwritten digits projectd along first three MSIR 
directions, with points marked according to digit: o = 0, x = 6, + = 9. 


One advantage of the MSIR approach is that it allows straightforward classification of ob¬ 
servations on the basis of the estimated finite mixtures for each class. In the present case, 
the estimated MSIR model postulates that digits from class h = {0, 6, 9} can be described as 
f{X\Y = h) = Y.k=i^hk<t>{X-,nhk.^hk), with number of components = {9, 7, 5} and co- 
variance matrices Yihk which are parametrized according to models VEV, EEV and EEV 


as 


described in Eraley and Raftery (2006). Thus, we may estimate the probability of obtaining a 
digit /i = {0,6,9}, given predictors X as follows: 


' E HX\Y = l)r,' 

«={0,6,9} 

where ti are the observed fractions of digits I in the sample. Recalling that the CDRS is subspace 
S{B) so that Y _IL XjB^X, the above expression can be expressed equivalently as: 


= /i|Z) 


fjZlY = h)fi 

f(ZlY = l)Ti’ 

l={0,6,9} 


where Z = XB are the MSIR variates and f{Z\Y = h) = Y^^=i^hk<k{Z', B ' p,f^i.,B ' X^kB). 
Observations, from either the training or test sets, can be classified according to the MAP 
principle. By Proposition 1, a classification rule can only be based on a subset of the most 
important directions. Eigure 14 shows the error rates for classifying digits from the training 
and test sets as a function of CDRS dimension. The smallest error rate is achieved when d = 3, 
i.e., when the first three MSIR directions are used. 

Table shows the training and test error rates for some classification methods: (i) classical 
linear discriminant analysis, (ii) discriminant analysis based on Gaussian finite mixture modeling 
( jEraley and Rafte^ 2002), (iii) SIR, obtained by fixing G = 1 for all classes and using the two 
estimable directions, and (iv) MSIR using the first three directions. The training errors are 
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the same for the first three methods, but the test errors are different, as GMMDA achieves the 
smallest value. Classification based on MSIR provides a larger error rate on the training set, 
but the smallest classification error on the test set. Thus, in this case, the classification rule 
based on the MSIR directions appears to be more robust, as it avoids overfitting the training 
set and achieves a good accuracy on the test set. 


Table 3: Classification error rates 
for some classifiers based on 
training and test sets. 

Error rate % 
Classifier Train Test 

LDA 0.18 2.32 

CMMDA 0.18 2.03 

SIR(d = 2) 0.18 2.13 

MSIR(d = 3) 0.32 1.55 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Dimension 



Figure 14: Classification error rates of MSIR for pen 
digit data set as a function of dimensionality. 


6 Concluding remarks 


In this paper we propose a model-based approach to dimension reduction which yields a more 
flexible version of SIR. This is achieved by modeling the distribution within each slice through 
a finite mixture of Caussian densities. The algorithm for MSIR estimation, determination of 
dimensionality, and some other results are presented. The favorable behavior of MSIR with 
respect to other popular dimension reduction methods are shown through extensive simulation 
studies. In particular, MSIR overcomes the main limitation of standard SIR in dealing with 
symmetric relationships. Compared with SAVE, MSIR is more efficient and has higher accuracy 
in the case of linear trends. Its performance, particularly for correlated predictors, is also 


competitive with, or superior to, that of DR, which is reported by Li and Wang (2007) as the 


most accurate dimension reduction method based on the first two inverse moments. 


Cook and Forzani (2009) recently introduced a likelihood-based dimension reduction method 
under the assumption of conditional normality of predictors given the response. Numerical 
optimization was used for maximization of the log-likelihood on Crassman manifolds. There 
are similarities between the two methods, but also some substantial differences. In particular, 
their proposal assumes X\Y ~ N(fiy, A^), where both mean and covariance matrix depend on 
the response variable. Different structures for ^ly and yield different models. In MSIR, we 
employed the flexibility of hnite mixture of Caussian densities to approximate the distribution 
of X\Y, with data-driven selection of the number of components and the covariance structure. 


Another recent proposal by Wang and Yin (2011) introduces the use of orthogonal series to 


estimate the inverse mean space. The relative merits and a thorough comparison of these 
approaches compared with our proposal is an area for further research. 

In this paper, we deal with the standard setting, in which the number of observations is 
larger than the number of predictors. However, in the case of p ^ n, we need to account for 
possible singularities in the estimation of covariance matrices, arising both from the fitting of 
Caussian mixture models and the marginal distribution of the predictors. This can be done by 
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imposing restrictions on the possible form of covariance structures, i.e., assuming spherical or 
diagonal covariance matrices. 

Finally, we point out that there are some open issues which deserve further study, as, for 
instance, the sensitivity of MSIR to the violation of the linearity condition, the applicability in 
case of high-dimensional predictors, the investigation of other criteria for selecting the mixture 
model parametrization and number of components within slice. 

Supplementary materials including further tables and graphs of simulation results are avail¬ 
able from the author’s web page. An R package called msir implementing the method proposed 
in this paper is available on the Comprehensive R Archive Network at http: //CRAN. R-pro j ect. 
org/package=nis ir, 
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